Today we will go through a series of steps that are familiar to many scientist doing empirical work. We will pretend to have collected some data1, clean them, visualize them, and analyze them using common statistical models. We will do all these things with the tidyverse, a collection of R packages for data manipulation and plotting that aims at being easily readable not only for machines, but also for humans2. These operations will be run in an environment that ensures computational reproducibility: if you have the initial data set, you will be able to reproduce the final results with a mouse click.
On your computer, you should already have installed R and RStudio (if not, follow the instructions for your operating system here and here).
One possible way to work on a project is to open a new .R file and write your code there. However, there are a few disadvantages:
setwd(), e.g., setwd("NAME/OF/YOUR/DIRECTORY")cat("\014") to clear the consolerm(list = ls()) to clear the environment (but it does not unload active packages!)You could do all those things, but… should you?
Jenny Bryan explains the reasons (without threats to commit arson) in this blog post. The main idea is to package code in a self-contained environment that everyone (including your future self) can run without changing their own environment.
Following Jenny Brian’s suggestion (and to avoid her wrath) we can create a project directly from RStudio:
If you check the directory you just created, you will see a file called YOURNAME_2020_MPILeipzig_intro_tidyverse.Rproj. If your RStudio session is still open, look at the top of the active window: the R Project is already active.
It is time to install the first package of this session: here3. For a brief overview of what the package can do, see this GitHub repository (authored by Jenny Bryan!).
Once the package is installed, we can load it:
This package is very useful to keep raw data, analysis scripts, preprocessed data, and reports in separate subdirectories (which is what you should always aim for). In which directory are we now?
## [1] "/home/aschetti/Documents/Projects/2020_MPILeipzig_intro_tidyverse"
We are now in the directory of the .Rproj file. However, our data are in a subdirectory called raw-data. here allows you to link to that directory without actually going there.
## [1] "/home/aschetti/Documents/Projects/2020_MPILeipzig_intro_tidyverse/raw-data"
In other words, we are still inside our R Project directory (/home/aschetti/Documents/Projects/2020_MPILeipzig_intro_tidyverse) but we can load and save files inside the respective subfolders.
Let’s verify that the data we need are actually in the raw-data subdirectory:
## [1] "MixedAttitude.dat"
Yes, the file MixedAttitude.dat is what we are going to play with today.
The data are in .dat format… how can we load them?4 The function read_csv from the readr package (part of the tidyverse) can load this kind of tab-separated files:
## Parsed with column specification:
## cols(
## ssj = col_double(),
## sex = col_double(),
## beerpos = col_double(),
## beerneg = col_double(),
## beerneut = col_double(),
## winepos = col_double(),
## wineneg = col_double(),
## wineneut = col_double(),
## waterpos = col_double(),
## waterneg = col_double(),
## waterneu = col_double()
## )
For additional details on the arguments of this function, type ?read_csv in the console.
Now the data are in a tibble, the equivalent of a data.frame in the tidyverse. Type att (the name of the data set in our environment) in the console to see the first cases.
## # A tibble: 22 x 11
## ssj sex beerpos beerneg beerneut winepos wineneg wineneut waterpos
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 6 5 38 -5 4 10
## 2 2 1 43 30 8 20 -12 4 9
## 3 3 1 15 15 12 20 -15 6 6
## 4 4 1 40 30 19 28 -4 0 20
## 5 5 1 8 12 8 11 -2 6 27
## 6 6 1 17 17 15 17 -6 6 9
## 7 7 1 30 21 21 15 -2 16 19
## 8 8 1 34 23 28 27 -7 7 12
## 9 9 1 34 20 26 24 -10 12 12
## 10 10 1 26 27 27 23 -15 14 21
## # … with 12 more rows, and 2 more variables: waterneg <dbl>, waterneu <dbl>
To see the full data set, click on att in the Environment window (by default on the upper right corner).
These are the data of 22 teenagers who saw neutral, positive, or negative advertisement of beer, wine, and water in 3 separate sessions. They were asked to rate the drinks on a scale ranging from –100 (dislike very much) through 0 (neutral) to +100 (like very much). Researchers wanted to reduce binge drinking in teenagers, so they hoped that pairing negative imagery with alcoholic beverages would lower these ratings5.
When the data set is big, it is useful to have a general idea of what it contains (tip: use the function glimpse):
## Rows: 22
## Columns: 11
## $ ssj <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ beerpos <dbl> 1, 43, 15, 40, 8, 17, 30, 34, 34, 26, 1, 7, 22, 30, 40, 15, …
## $ beerneg <dbl> 6, 30, 15, 30, 12, 17, 21, 23, 20, 27, -19, -18, -8, -6, -6,…
## $ beerneut <dbl> 5, 8, 12, 19, 8, 15, 21, 28, 26, 27, -10, 6, 4, 3, 0, 4, 9, …
## $ winepos <dbl> 38, 20, 20, 28, 11, 17, 15, 27, 24, 23, 28, 26, 34, 32, 24, …
## $ wineneg <dbl> -5, -12, -15, -4, -2, -6, -2, -7, -10, -15, -13, -16, -23, -…
## $ wineneut <dbl> 4, 4, 6, 0, 6, 6, 16, 7, 12, 14, 13, 19, 14, 21, 19, 7, 12, …
## $ waterpos <dbl> 10, 9, 6, 20, 27, 9, 19, 12, 12, 21, 33, 23, 21, 17, 15, 13,…
## $ waterneg <dbl> -14, -10, -16, -10, 5, -6, -20, -12, -9, -6, -2, -17, -19, -…
## $ waterneu <dbl> -2, -13, 1, 2, -5, -13, 3, 2, 4, 0, 9, 5, 0, 4, 2, 8, 10, 8,…
Looking at the data, two things stand out:
We need to eliminate these two participants from the data. Let’s start from participant #98.
Here we keep all participants that are not participant #98 (!= operator). The symbol %>% is the pipe operator, which allows to serially concatenate functions applied to the same data frame. Read it as “and then”. In the example above, we called the data frame att and then applied the filter function to discard participant #98.
How to verify that the filtering worked? You could click on att in the Environment window and look for participant #98, but it can be tedious with big data sets. An easy and flexible way is to check if the filtered data set still contains participant #98:
## # A tibble: 0 x 11
## # … with 11 variables: ssj <dbl>, sex <dbl>, beerpos <dbl>, beerneg <dbl>,
## # beerneut <dbl>, winepos <dbl>, wineneg <dbl>, wineneut <dbl>,
## # waterpos <dbl>, waterneg <dbl>, waterneu <dbl>
The tibble is empty, confirming the deletion.
Now let’s think of participant #99.
The code above allows us to keep all rows whose values of the variable waterneu are not NAs. Did it work?
## # A tibble: 0 x 11
## # … with 11 variables: ssj <dbl>, sex <dbl>, beerpos <dbl>, beerneg <dbl>,
## # beerneut <dbl>, winepos <dbl>, wineneg <dbl>, wineneut <dbl>,
## # waterpos <dbl>, waterneg <dbl>, waterneu <dbl>
It worked! We could also apply both filters simultaneously using the & (AND) operator:
However, with large data sets, it is unfeasible to visually check all values of all variables of all participants to see if something went wrong during data collection. A better strategy is to use the knowledge we already have about the experimental design. In this case, we know that ratings were collected via scales ranging from -100 to +100, so anything below -100 or above +100 should not be analyzed. Here we decide to only keep participants whose values of the variable waterneu are within the aforementioned range:
This strategy allows us to simultaneously get rid of both participant #98 and #99.
## # A tibble: 0 x 11
## # … with 11 variables: ssj <dbl>, sex <dbl>, beerpos <dbl>, beerneg <dbl>,
## # beerneut <dbl>, winepos <dbl>, wineneg <dbl>, wineneut <dbl>,
## # waterpos <dbl>, waterneg <dbl>, waterneu <dbl>
We only need a subset of variables in this data set. Researchers were interested in reducing binge drinking, so we will focus our attention on negative imagery and one alcoholic beverage. We choose beer. We also need control conditions, i.e., neutral imagery and water.
In the next code snippet, we simultaneously select and rename the variables we want to keep.
att_filter_select <-
att_filter %>%
dplyr::select( # there are several functions named 'select', so we need to explicitly call the one from package 'dplyr'
participant = ssj, # new name = old name
sex,
beer_negative = beerneg,
beer_neutral = beerneut,
water_negative = waterneg,
water_neutral = waterneu
)
head(att_filter_select, n = 5) # show only first 5 rows## # A tibble: 5 x 6
## participant sex beer_negative beer_neutral water_negative water_neutral
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 6 5 -14 -2
## 2 2 1 30 8 -10 -13
## 3 3 1 15 12 -16 1
## 4 4 1 30 19 -10 2
## 5 5 1 12 8 5 -5
Let’s inspect the variable sex:
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
A string of 1s and 2s. It’s easy to get a glimpse of the unique values of this variable because there are not many observations. When you have more observations, better use the function unique:
## [1] 1 2
What type of variable is it?
## [1] "numeric"
Class numeric has numeric values (i.e., double precision floating point numbers).
There are many variable types in R. An interesting one is factor, i.e., categorical variables. In our case, it would make sense to consider sex as a factor with two levels, male and female. Let’s transform it:
# Field et al. do not specify the coding,
# so we will assume that 1 is female and 2 is male
att_filter_select_recode <-
att_filter_select %>%
mutate( # create new variables
gender = recode( # the new variable 'gender' is a recoding of...
factor(sex), # ... variable 'sex', coerced as factor
"1" = "female", "2" = "male" # assign labels to factor levels
)
)
head(att_filter_select_recode, n = 5)## # A tibble: 5 x 7
## participant sex beer_negative beer_neutral water_negative water_neutral
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 6 5 -14 -2
## 2 2 1 30 8 -10 -13
## 3 3 1 15 12 -16 1
## 4 4 1 30 19 -10 2
## 5 5 1 12 8 5 -5
## # … with 1 more variable: gender <fct>
Now we have two variables with redundant information… let’s get rid of sex:
att_filter_select_recode <- # overwrite
att_filter_select_recode %>%
dplyr::select(-sex) # discard unused variables with '-'
head(att_filter_select_recode, n = 5)## # A tibble: 5 x 6
## participant beer_negative beer_neutral water_negative water_neutral gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 1 6 5 -14 -2 female
## 2 2 30 8 -10 -13 female
## 3 3 15 12 -16 1 female
## 4 4 30 19 -10 2 female
## 5 5 12 8 5 -5 female
Data frames can be in wide or long format:
Our data are now in wide format. However, the packages we are going to use today need data in long format. Let’s convert from wide to long:
att_filter_select_recode_long <-
att_filter_select_recode %>%
pivot_longer(
cols = c(beer_negative, beer_neutral, water_negative, water_neutral), # variables to be stacked
names_to = "condition", # name of new variable with all condition levels
values_to = "ratings" # name of new variable with all values
)
head(att_filter_select_recode_long, n = 5)## # A tibble: 5 x 4
## participant gender condition ratings
## <dbl> <fct> <chr> <dbl>
## 1 1 female beer_negative 6
## 2 1 female beer_neutral 5
## 3 1 female water_negative -14
## 4 1 female water_neutral -2
## 5 2 female beer_negative 30
The column ratings contains the values of our dependent variable, whereas condition contains all our conditions.
This experiment has two independent variables, drink and imagery. In our analysis, we wish to know the separate contribution of these two independent variables as well as their interaction. So, we need to separate the variable condition into 2 columns:
att_filter_select_recode_long_sep <-
att_filter_select_recode_long %>%
separate(condition, c("drink", "imagery"), # separate 'condition' into 'drink' and 'imagery'
remove = TRUE # remove original column
) %>%
# over-write participant, drink, and imagery
# in order to convert them to factors
mutate(
participant = as_factor(participant),
drink = as_factor(drink),
imagery = as_factor(imagery)
)
head(att_filter_select_recode_long_sep, n = 5)## # A tibble: 5 x 5
## participant gender drink imagery ratings
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 female beer negative 6
## 2 1 female beer neutral 5
## 3 1 female water negative -14
## 4 1 female water neutral -2
## 5 2 female beer negative 30
The original data are saved in a .dat file. Instead, we will save our processed data as .csv.
Thanks to the versatility of the tidyverse (especially the pipe operator), all the above operations can be performed in one go:
att_filter_select_recode_long_sep_onego <-
# load data
read_csv(
here(
"raw-data",
"MixedAttitude.dat"
)
) %>%
# filter cases
filter(
., # this point refers to the current active tibble (it can be omitted)
waterneu >= -100 & waterneu <= +100
) %>%
# select and rename variables
dplyr::select(
.,
participant = ssj,
sex,
beer_negative = beerneg,
beer_neutral = beerneut,
water_negative = waterneg,
water_neutral = waterneu
) %>%
# recode variable
mutate(
gender = recode(
factor(sex),
"1" = "female", "2" = "male"
)
) %>%
# delete variable
dplyr::select(
.,
-sex
) %>%
# pivot
pivot_longer(
.,
cols = c(beer_negative, beer_neutral, water_negative, water_neutral),
names_to = "condition",
values_to = "ratings",
) %>%
# separate variable in two
separate(
.,
condition, c("drink", "imagery"),
remove = TRUE
) %>%
# convert variables to factors
mutate(
.,
participant = as_factor(participant),
drink = as_factor(drink),
imagery = as_factor(imagery)
)
# save to .csv
# to be done separately, or the conversion to factor will not succeed
# (i.e., values in 'participant', 'drink', and 'imagery' will still be characters)
write_csv(
att_filter_select_recode_long_sep_onego,
here("tidy-data", "data_attitude_onego.csv")
)
head(att_filter_select_recode_long_sep_onego, n = 5)## # A tibble: 5 x 5
## participant gender drink imagery ratings
## <fct> <fct> <fct> <fct> <dbl>
## 1 1 female beer negative 6
## 2 1 female beer neutral 5
## 3 1 female water negative -14
## 4 1 female water neutral -2
## 5 2 female beer negative 30
Are the two data sets really identical? Verify with the all_equal function:
## [1] TRUE
It is often required to provide summary statistics of the data. How to do it in tidyverse?
summary_att_filter_select_recode_long_sep_onego <-
att_filter_select_recode_long_sep_onego %>%
group_by(drink, imagery) %>% # group according to conditions
summarize(
n = n(), # number of observations
mean = mean(ratings), # mean
sd = sd(ratings), # standard deviation
sem = sd / sqrt(n), # standard error of the mean
min = min(ratings), # range (min)
max = max(ratings) # range (max)
) %>%
ungroup() %>% # better ungroup, to avoid possible conflicts later on
print() # another way of displaying the results in console## `summarise()` regrouping output by 'drink' (override with `.groups` argument)
## # A tibble: 4 x 8
## drink imagery n mean sd sem min max
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beer negative 20 4.45 17.3 3.87 -19 30
## 2 beer neutral 20 10 10.3 2.30 -10 28
## 3 water negative 20 -9.2 6.80 1.52 -20 5
## 4 water neutral 20 2.35 6.84 1.53 -13 12
Do the following operations in one go:
MixedAttitude.dat)sex to a categorical variable (named gender) with 2 levels:
sex, beerpos, beerneg, beerneut, winepos, and waterposrename, rename the variables you kept:
drink and imagery)participant, gender, drink, and imagery to factorsWhen you have completed these operations, save the data set as data_attitude_wine.csv in the subfolder tidy-data.
Plotting is one of the most satisfying things to do in R, especially if you use the package ggplot2… part of the tidyverse! In a nutshell, ggplot2 allows you to build plots iteratively using a series of layers. You start with a data set and specify its aesthetics. Later, you can add layers with annotations, statistical summaries, and so on.
Let’s start by creating a basic bar plot. We will use the data frame with the summary statistics that we just created (summary_att_filter_select_recode_long_sep_onego).
barplot_basic <- # the plot is saved in this variable
ggplot(
summary_att_filter_select_recode_long_sep_onego, # data
aes(
x = drink, # 'drink' variable on x-axis
y = mean, # mean ratings on y-axis
fill = imagery # different fill colors for different levels of 'drink' variable
)
)
barplot_basic # visualize the plotThe data are not displayed… why? The ggplot function includes data (what you want to visualize) and aesthetic mappings (e.g., which variable should be represented on the x-axis?), but the visualization itself is carried out by layers, for example geometric elements (e.g., points or lines). For more details, check out ggplot2: Elegant Graphics for Data Analysis.
Since we want to create a bar graph, we must use geom_bar.
barplot_1 <-
barplot_basic +
geom_bar(stat = "identity") # use the values in the data frame (no transformations)
barplot_1What the hell is that?!? Oh no, it’s a stacked bar graph! We don’t want that! How can we get separate bars for negative and neutral imagery?
barplot_2 <-
barplot_basic +
geom_bar(
stat = "identity",
position = position_dodge() # bars are next to each other
)
barplot_2Better. Let’s add outlines.
barplot_3 <-
barplot_basic +
geom_bar(
stat = "identity",
position = position_dodge(),
color = "black", # black outlines
size = 1 # line thickness
)
barplot_3Something is missing… ah, error bars! Display the standard deviation with geom_errorbar.
Note that now we can overlay this new geom over barplot_3, because we don’t need to modify geom_bar anymore.
barplot_4 <-
barplot_3 +
geom_errorbar(
aes(
# standard deviation below and and above the mean
ymin = mean - sd,
ymax = mean + sd
),
width = .2, # width of the error bars
position = position_dodge(.9) # position (centered on the bar)
)
barplot_4These colors are terrible… let’s use a better-looking color palette. The package viridis uses colors that are easier to distinguish for people with color blindness.
barplot_5 <-
barplot_4 +
scale_fill_viridis(
option = "viridis", # see ?scale_color_viridis for other color palettes
discrete = TRUE # map colors to discrete values
)
barplot_5Let’s add a final cosmetic touch.
barplot_6 <-
barplot_5 +
scale_y_continuous( # y-axis...
"", # no title
limits = c(-16, 16), # min/max values
breaks = seq(-16, 16, 4) # tick marks
) +
geom_hline( # horizontal lines...
yintercept = seq(-16, 16, 4), # ... following tick marks
linetype = "dotted",
colour = "#999999",
size = .8,
alpha = .5
) +
ggtitle("Bar Plot") + # plot title
theme_classic(base_size = 18) + # classic theme (e.g., no gray background) and text size
# additional elements to overwrite the defaults from theme_classic
theme(
plot.title = element_text(
size = 24, # title: text size
hjust = .5 # title: centered
),
legend.position = c(.9, .9) # legend position (upper right corner)
)
barplot_6No matter how pretty a bar graph is, it remains a suboptimal way of displaying your data.
A better solution is to use RDI plots.
RDI plots display Raw data, Descriptives, and Inferential statistics altogether. Nothing is hidden by useless bars, everything is shown so that readers can make up their minds. For example, the plot6 below shows:
To build it, we need to install ggpirate, which provides handy functions to create violin plots.
As you can see, we installed ggpirate in a different way: we used the function install_github from the package devtools. That’s because ggpirate is on GitHub but not on CRAN, the official repository for R packages. Also, note that we did not explicitly load the whole devtools library, but we simply picked out the function we needed.
Don’t forget to load this newly installed package 😉
Now we must use the whole dataset (saved as att_filter_select_recode_long_sep_onego), not the data frame with the summary statistics we used for the bar plots (summary_att_filter_select_recode_long_sep_onego).
Here is the code for the plot:
RDIplot <-
att_filter_select_recode_long_sep_onego %>%
ggplot(
aes(
x = imagery,
y = ratings,
color = drink, # different line colors for different levels of 'drink' variable
fill = drink # different fill colors for different levels of 'drink' variable
)
) +
geom_pirate(
bars = FALSE, # no bars
cis = TRUE, # confidence intervals
lines = TRUE, lines_params = list(color = "black", alpha = .4), # lines indicating means
points = TRUE, points_params = list(shape = 21, color = "black", size = 5, alpha = .4), # individual data points
violins = TRUE, violins_params = list(size = 1), # smoothed densities
show.legend = TRUE # legend
) +
scale_y_continuous( # y-axis
limits = c(-100, 100), # min-max
breaks = seq(-100, 100, 10) # tick marks
) +
coord_cartesian(ylim = c(-40, 40)) + # zoom into a specified portion of the plot
geom_hline( # horizontal lines...
yintercept = seq(-100, 100, 10), # ... following tick marks
linetype = "dotted",
colour = "#999999",
size = .8,
alpha = .5
) +
scale_fill_viridis( # color palette for all fills
option = "viridis",
discrete = TRUE
) +
ggtitle("RDI Plot") +
theme_minimal(base_size = 18) +
theme(
panel.grid = element_blank(), # plot background
legend.box.background = element_rect(color = "transparent"), # legend box background
legend.position = c(.9, .9), # legend position
plot.title = element_text(size = 26, hjust = .5) # resize and center title
)
RDIplotI hope it doesn’t feel like this:
How to draw an owl. Source: Know Your Meme.
If so, please let me know… I can explain each code snippet in more detail.
The plot can be saved with the following command:
Create a plot similar to what is shown above (with wine instead of beer) and save it in the subfolder images
The same as above, separately for female and male participants (hint: use facet_wrap)
Recreate a plot similar to #2 using the yarrr package (warning: saving the plot to file can be tricky 😉). Check out YaRrr! The Pirate’s Guide to R for help.
The aim of this study was to assess whether negative imagery would influence the likeness ratings of alcoholic beverages. It’s time to test this hypothesis statistically.
We will run a 2 (drink: water, beer) x 2 (imagery: neutral, negative) repeated measures ANOVA on these ratings.
For this demonstration I chose the package afex, authored by Henrik Singmann.
The code below shows how to run an ANOVA with this versatile and user-friendly package.
library(afex)
rmANOVA_att <-
aov_ez(
data = att_filter_select_recode_long_sep_onego, # data
id = "participant", # variable with subject identifier
dv = "ratings", # dependent variable
within = c("drink", "imagery"), # within-subject variables
type = 3 # type-III sums of squares (default in SPSS)
)
rmANOVA_att## Anova Table (Type 3 tests)
##
## Response: ratings
## Effect df MSE F ges p.value
## 1 drink 1, 19 252.84 8.97 ** .193 .007
## 2 imagery 1, 19 82.92 17.63 *** .134 <.001
## 3 drink:imagery 1, 19 26.66 6.75 * .019 .018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
The results show:
The drink x imagery interaction is statistically significant… let’s run paired comparisons. afex uses functions included in the emmeans package.
# we didn't explicitly install these packages, but they are part of the dependencies of 'afex'
library(emmeans)
library(multcomp)
# set afex_options as below to use appropriate degrees of freedom
# for details, see https://cran.r-project.org/web/packages/afex/vignettes/afex_anova_example.html#post-hoc-contrasts-and-plotting
afex_options(emmeans_model = "multivariate")
posthoc_att <-
emmeans(rmANOVA_att, ~ imagery * drink) %>% # estimated marginal means
pairs(., # compare differences between estimated marginal means
test = adjusted("free") # "free": generalization of Bonferroni-Holm correction, taking into account correlations among model parameters
) %>%
as.glht() %>% # better p-value adjustment for multiple testing
summary() %>% # complete output
print()##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## negative,beer - neutral,beer == 0 -5.550 2.604 -2.131 0.1421
## negative,beer - negative,water == 0 13.650 4.329 3.153 0.0194 *
## negative,beer - neutral,water == 0 2.100 4.997 0.420 0.9613
## neutral,beer - negative,water == 0 19.200 2.934 6.544 <0.001 ***
## neutral,beer - neutral,water == 0 7.650 3.034 2.521 0.0687 .
## negative,water - neutral,water == 0 -11.550 2.044 -5.652 <0.001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Results show7:
Interestingly, beer ratings between negative and neutral imagery are not statistically significant (p = 0.142). This might indicate that the experimental manipulation did not have the intended effect, although such a conclusion cannot be inferred from the non-significant p-value alone.
The output of rmANOVA_att and posthoc_att is almost publication-ready… almost. Something is missing.
It is good practice to report effect sizes along with p-values, so that readers can make their own mind with respect to the importance of the observed effects. Even better, you could report confidence intervals around effect sizes, so that readers can have a clear picture of the precision of your estimation.
I particularly like the idea of bootstrapping effect sizes, a better approach when the data are known not to be normally distributed or when the distribution is unknown8. I will show you how to do it using the package bootES9.
install.packages("bootES")
library(bootES)
# we also install 'broom', a package that tidies up the results of statistical models
install.packages("broom")
library(broom)We will compute Hegdes’ g, an unbiased estimate of \(\delta\) (for details, see here).
Because our dependent variable consists of ratings collected from the same participants in different conditions (i.e., repeated measures), we must first manually compute the difference scores between our contrasts of interest (see the output of the paired comparisons above).
att_prep_bootES <-
att_filter_select_recode_long_sep_onego %>%
pivot_wider( # convert from long to wide format
id_cols = c("participant", "gender"),
names_from = c("drink", "imagery"),
values_from = "ratings",
names_sep = "_"
) %>%
mutate( # compute differences
beer_negativeVSbeer_neutral = beer_negative - beer_neutral,
beer_negativeVSwater_negative = beer_negative - water_negative,
beer_negativeVSwater_neutral = beer_negative - water_neutral,
beer_neutralVSwater_negative = beer_neutral - water_negative,
beer_neutralVSwater_neutral = beer_neutral - water_neutral,
water_negativeVSwater_neutral = water_negative - water_neutral
) %>%
dplyr::select(-c(beer_negative:water_neutral)) %>% # delete unused variables
pivot_longer( # re-convert to long format
cols = beer_negativeVSbeer_neutral:water_negativeVSwater_neutral,
names_to = "diff_conds",
values_to = "ratings"
)
head(att_prep_bootES, n = 5)## # A tibble: 5 x 4
## participant gender diff_conds ratings
## <fct> <fct> <chr> <dbl>
## 1 1 female beer_negativeVSbeer_neutral 1
## 2 1 female beer_negativeVSwater_negative 20
## 3 1 female beer_negativeVSwater_neutral 8
## 4 1 female beer_neutralVSwater_negative 19
## 5 1 female beer_neutralVSwater_neutral 7
Now we can calculate the standardized effect size for each difference scores. Using functions in the purrr package (part of the tidyverse!), we will create separate lists for each difference score and calculate bootstrapped Hegdes’s g for each of them.
set.seed(20200618) # set seed of random number generator, to ensure reproducibility
att_bootES <-
att_prep_bootES %>%
group_by(diff_conds) %>% # group according to difference scores
nest() %>% # create separate lists with difference scores
rename(diff_ratings = data) %>% # rename column with lists, to avoid confusion
mutate( # create new variable with bootstrapping results
bootstrap = map( # function to apply the 'bootES' function to each list
diff_ratings, # name of the column with lists
~ bootES( # function to apply to each list
., # tibble with all data
data.col = "ratings", # dependent variable within each list
R = 5000, # number of bootstrap samples
effect.type = "hedges.g", # type of effect size
ci.type = "bca", # bootstrap method
ci.conf = .95 # confidence level
),
data = .
),
tidy = map(bootstrap, # function to apply the 'tidy' function to each list
broom::tidy, # extract bootstrap results from each list into new columns
conf.int = TRUE) # display confidence intervals
) %>%
ungroup() # ungroup
# summary of results
att_HedgesG <-
att_bootES %>%
unnest(tidy) %>% # put bootstrap results into new columns
dplyr::select( # select and rename variables to report (also changing the order)
diff_conds,
"Hedges_g" = statistic,
"CI95_low" = conf.low,
"CI95_high" = conf.high,
"bias",
"std_error" = std.error
) %>%
print()## # A tibble: 6 x 6
## diff_conds Hedges_g CI95_low CI95_high bias std_error
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beer_negativeVSbeer_neutral -0.457 -1.05 -0.0321 -0.0344 0.255
## 2 beer_negativeVSwater_negative 0.677 0.286 1.24 0.0313 0.237
## 3 beer_negativeVSwater_neutral 0.0902 -0.358 0.527 -0.00248 0.223
## 4 beer_neutralVSwater_negative 1.40 0.945 2.31 0.0911 0.346
## 5 beer_neutralVSwater_neutral 0.541 0.124 1.08 0.0197 0.243
## 6 water_negativeVSwater_neutral -1.21 -2.89 -0.666 -0.150 0.542
The output (att_HedgesG) is a tibble containing the results of the bootstrapping procedure for each diff_conds.
This is a very flexible and powerful approach, which can be used to run all kinds of operations to subsets of your data. See an example here on how to run many linear models simultaneously.
Load data_attitude_wine.csv.
Run a 2 (drink) x 2 (imagery) repeated measures ANOVA on likeness ratings and display the results in console.
Run a 2 (gender) x 2 (drink) x 2 (imagery) mixed ANOVA on likeness ratings. Remember: gender is a between-subject factor!
Paired contrasts: test only difference ratings between female and male participants (hint: see example here).
You have now successfully used the tidyverse to load, preprocess, plot, and analyze your data. Even better, you have done it in an environment that facilitates reproducibility: you only need the raw data and the .Rmd file to reproduce the figures and the statistical results.
However, I said facilitate for a reason: even using this workflow, computational reproducibility is not guaranteed. For example, a specific R package might drastically change its functionalities so that the syntax is no longer valid, or a colleague may be using a different operating system which relies on different libraries.
There are ways to (almost) ensure computational reproducibility. The best option I have found so far is to create an R package with data, preprocessing and analysis scripts, and files that document the specific version of each package used for the computations. This package can be embedded into a container, a sort of virtual machine that contains an operating system with its own software, libraries, and configuration files. Learning these skills require some time but, in my opinion, it’s worth the time and effort. If you are interested, check out rrtools.
Another good (and low-effort 😄) option is to use the function sessionInfo, which lists important information related to your operating system and libraries as well as R and package versions.
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_0.5.6 bootES_1.2 boot_1.3-25 multcomp_1.4-13
## [5] TH.data_1.0-10 MASS_7.3-51.6 survival_3.1-12 mvtnorm_1.1-1
## [9] emmeans_1.4.7 afex_0.27-2 lme4_1.1-23 Matrix_1.2-18
## [13] ggpirate_0.1.1 viridis_0.5.1 viridisLite_0.3.0 emo_0.0.0.9000
## [17] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
## [21] readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 ggplot2_3.3.1
## [25] tidyverse_1.3.0 here_0.1 devtools_2.3.0 usethis_1.6.1
## [29] knitr_1.28
##
## loaded via a namespace (and not attached):
## [1] minqa_1.2.4 colorspace_1.4-1 ellipsis_0.3.1
## [4] rio_0.5.16 rprojroot_1.3-2 estimability_1.3
## [7] fs_1.4.1 rstudioapi_0.11 farver_2.0.3
## [10] remotes_2.1.1 fansi_0.4.1 lubridate_1.7.9
## [13] xml2_1.3.2 codetools_0.2-16 splines_3.6.3
## [16] pkgload_1.1.0 jsonlite_1.6.1 nloptr_1.2.2.1
## [19] dbplyr_1.4.4 compiler_3.6.3 httr_1.4.1
## [22] backports_1.1.7 assertthat_0.2.1 cli_2.0.2
## [25] htmltools_0.5.0 prettyunits_1.1.1 tools_3.6.3
## [28] lmerTest_3.1-2 coda_0.19-3 gtable_0.3.0
## [31] glue_1.4.1 reshape2_1.4.4 Rcpp_1.0.4.6
## [34] carData_3.0-4 cellranger_1.1.0 vctrs_0.3.1
## [37] nlme_3.1-147 xfun_0.14 ps_1.3.3
## [40] openxlsx_4.1.5 testthat_2.3.2 rvest_0.3.5
## [43] lifecycle_0.2.0 statmod_1.4.34 zoo_1.8-8
## [46] scales_1.1.1 hms_0.5.3 sandwich_2.5-1
## [49] parallel_3.6.3 yaml_2.2.1 curl_4.3
## [52] memoise_1.1.0 gridExtra_2.3 stringi_1.4.6
## [55] highr_0.8 desc_1.2.0 pkgbuild_1.0.8
## [58] zip_2.0.4 rlang_0.4.6 pkgconfig_2.0.3
## [61] evaluate_0.14 lattice_0.20-41 labeling_0.3
## [64] processx_3.4.2 tidyselect_1.1.0 plyr_1.8.6
## [67] magrittr_1.5 R6_2.4.1 generics_0.0.2
## [70] DBI_1.1.0 pillar_1.4.4 haven_2.3.1
## [73] foreign_0.8-76 withr_2.2.0 abind_1.4-5
## [76] modelr_0.1.8 crayon_1.3.4 car_3.0-8
## [79] utf8_1.1.4 rmarkdown_2.2 grid_3.6.3
## [82] readxl_1.3.1 data.table_1.12.8 blob_1.2.1
## [85] callr_3.4.3 reprex_0.3.0 digest_0.6.25
## [88] xtable_1.8-4 numDeriv_2016.8-1.1 munsell_0.5.0
## [91] sessioninfo_1.1.1
I hope to have given you a glimpse of the versatility, readability, and user-friendliness of the tidyverse, and inspired you to work in a reproducible environment to benefit your future self and other colleagues. Enjoy your tidy, reproducible new life!
Thanks!
Hopefully you’re not familiar with pretending to collect data… if so, please tell your story by writing a book.↩︎
As you may imagine, today we won’t have time to cover all the amazing things you can do with these packages. Also, they are a gift that keeps on giving: I have been using them for a while and I keep discovering useful functions. If you want to learn more, read R for Data Science by Garrett Grolemund and Hadley Wickham.↩︎
It is good to know where your packages are. Type .libPaths() in the RStudio console to find out.↩︎
When you have no idea where to start, Google (or its privacy-oriented alternative StartPage) and StackOverflow are your friends!↩︎
This is a modified version of a data set included in the book Discovering Statistics Using R (Field, Miles & Field, 2012).↩︎
These RDI plots are sometimes called pirateplots, in honor of Nathaniel D. Phillips’s book YaRrr! The Pirate’s Guide to R. If you are curious, check out the yarrr package.↩︎
Another statistically significant finding (ratings of beer after neutral imagery vs. ratings of water after negative imagery) is not of interest for our research question, whereas the ratings difference between beer and water after neutral imagery is only close to the conventional threshold of p = 0.05. This result is not significant. It does not “approach significance”, there is no “trend toward significance”, or any other creative way to describe a non-significant finding in a way that will increase the chances of publication. For a fun take on the subject, see this blog.↩︎
If you prefer not to bootstrap your effect sizes, you can use the MBESS package (a useful tutorial can be found here).↩︎
A clear explanation of how to use bootES is provided by the authors of the package in their paper.↩︎
schettino@eur.nl